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Abstract. A perceptron is trained by a random bit sequence. In comparison to the 
corresponding classification problem, the storage capacity decreases to a c = 1.70±0.02 
due to correlations between input and output bits. The numerical results are supported 
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1. Introduction 

Artificial neural networks are successful in predicting time series (Weigand et al 1993). 
Given a sequence of real numbers, a multilayer network is able to learn from N 
consecutive numbers the following one. After learning a part of the sequence, the 
network is able to generalize: If N consecutive numbers are taken from the part of 
the sequence which the network has not learned, the network can predict the following 
number to some extent. 

Using methods and models of statistical mechanics, training from a set of examples 
and generalization of neural networks has been studied intensively (Hertz et al 1991, 
Kinzel et al 1991, Opper et al 1996). Most work has been concentrated on perceptrons 
and binary classification problems. A set of A^-dimensional input vectors is classified 
by a perceptron. A different perceptron is trained by this set of examples; after the 
training process the network is able to generalize: it has some overlap to the weights of 
the perceptron which has generated the examples. If the classification is not performed 
by a different perceptron but is assigned randomly, the network can still learn a certain 
amount of examples. The maximum number of examples, which can be classified by a 
perceptron, is related to the storage capacity of the corresponding attractor networks 
(Gardner 1988). 
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Only recently this approach has been extended to time series analysis (Eisenstein 
et al 1995) A perceptron was trained from a series of bits which was produced by a 
different perceptron. Hence also the generation of time series by a nerual network is 
interesting in this context, and recently an analytic solution of a stationary time series 
generated by a peceptron has been found (Kanter et al 1995). 

It turns out that a perceptron can predict bit sequences very well, if those are taken 
from stationary time series produced by a different perceptron (Eisenstein et al 1995). 
Already a small training set leads to perfect prediction of the rest of the sequence, at 
least for N oo. However, the overlap between a learning and a generating network is 
very small. 

In this paper we study the analogy of the storage capacity problem in the context 
of bit sequences: A set of P consecutive bits, which are randomly chosen, is repeated 
periodically (or placed on a ring). A perceptron with N < P is trained on this bit 
sequence, where the output bit is given by the bit which follows the N input bits. Hence, 
the only difference to the examples used for the classification problem are correlations 
between the input and output: The output bit is contained in the input of N examples. 

In Section 2 we introduce the bit sequence, which we use for training a perceptron 
which is simulated in Section 3. Section 4 presents a signal to noise analysis of the 
Hebbian learning rule. A general Boolean function is considered in Section 5, and the 
last Section contains a summary and the conclusions. 



2. Bit sequence 

P bits Si G {—1, 1}; i = 1, ■ ■ . , P are chosen randomly and independently. This sequence 
is repeated periodically from % = — oo to i = oo (or placed on a ring, equivalently). N 
consecutive bits are used as an input to a perceptron with weights Wj E M;j = 1, . . . , N 
(see Figure 1): 

N 

a v = sign^3 WjZj witn £J = Sj-i+v (!) 

i=i 

The problem we are addressing here is the following: Can we find a weight vector 
w = (w%, . . . , wn) which reproduces the next bits in the sequence, i. e. 

a v = Sv+n for all v G JV . (2) 

In particular we are interested in the maximal number P C {N) of bits which can be 
reproduced correctly by a perceptron for A^ — > oo; as usual we define 

P (N) 

a = P/N ; a c = lim . (3) 

There exist mathematical theorems about the number configuration {cr u } which can 
be realized by Eq.(ffl), which are already more than 140 years old (Schlafli 1950, Cover 
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1965): If the P input vectors = . . . , are in general position; i. e. if any subset 
of N vectors is linearly independent, then the number C(P, N) of possible configurations 
{o v } e {+1, -1} P is given by 



C(P,N) = 2Y: . . (4) 




In our case of the random bit sequence we expect the input vectors to be in 
general position. For P < N one obtains C(P,N) = 2 P ; hence, any bit sequence with 
P < N can be perfectly predicted by a perceptron. For P < 2N there is still a large 
fraction of configurations which is given by Eq. ([]]); this fraction goes to one for iV — > oo. 
This means that for random configurations {<J U } the probability to map them by a 
perceptron is one in the limit of N — > oo. For P > 2N this probability is zero. Hence, 
for a perceptron and random examples one finds a c = 2 (Gardner 1988). 

However, in our case the configurations {cr v } are not randomly chosen but taken 
from the input vectors. Each output bit a v appears in N input vectors . . . ,£ u+ , 
too. There are correlations between the input vectors and the output bits. In addition, 
only the fraction of configuration a v which cannot be reproduced by a perceptron goes 
to zero for iV — > oo and N < P < 2N; their number is still increasing exponentially 
with N. For instance, for N = 100 and a = 1.8 Eq.(Q) gives about 10 54 configurations 
which are not linearly separable, that is 6.7% of all of the possible 2 180 ones. On the other 
side, for P > 2N the number of configurations which can be reproduced by a perceptron 
still increases exponentially with N, although their fraction disappears. Hence, it is not 
obvious, whether the patterns given by a bit sequence belong to the first or second class, 
which means whether a c < 2 or a c > 2. 

In the uncorrelated case the storage capacity a c has been calculated using the replica 
method (Gardner 1988). Correlations between the input vectors do not change the result 
a c = 2. Only if there is a bias for the output bits and for the input bits the storage 
capacity a c increases with the bias. If the patterns are anticorrelated a c can be lower 
than a c = 2, too (Lopez et al 1995). 

For our problem we have formulated the version space of weights in terms of replicas. 
One has to average over P random bits, only, instead of P ■ N in the uncorrelated case. 
However, we did not succeed in getting rid of the correlations and could not solve the 
integral. Therefore, we have studied the bit sequence numerically. 



3. Perceptron: Simulations 

To calculate the storage capacity a c of the perceptron being trained by a random bit 
sequence, we have used two methods: 

(i) We have used several routines which try to minimize the number of errors and 
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indicate whether they did succeed or not. Hence, we obtained a fraction f(a, N) of 
patterns for which the routine could find a solution. The capacity a c (N) is defined 
by f(a c ,N) = 1/2. Obviously we obtain a lower bound for the true a c , only. 
The results did not dependent on the actual algorithm within the expected error 
bounds. 

We have used a routine that minimizes the "linear cost-function" E = J2v=q9(1 — 
E u )(l — E v ) with E u = jfY,jLi w j£,j< jl ' (without constraining the vector w). 

(ii) The other estimate uses the median learning time (Priel et al 1994). For random 
patterns the average learning time r a of the perceptron algorithm diverges as 
r^ 1 / 2 ~ [a c — a) for a — > a c (Opper 1988). We use this power law in our case, too. 
The median r m of the distribution of learning times is calculated for a < a c and a c 
is obtained from a fit to the power law divergence. This method has the advantage 
that one does not have to determine whether a pattern cannot be learned at all. If 
the number of learning steps is larger than the median the algorithm can stop; this 
saves a large amount of computer time. 

Figure 2 and Table 1 show the results of the simulations fj]. In the uncorrelated case 
both of the methods give the exact result a c = 2 within the statistical error and for 
N = 100, already. If we use the input from the bit sequence but random output bits 
the results agree with a c = 2, too. However, if in addition we use the output bits from 
the bit sequence we obtain a c = 1.70 ± 0.02. Hence, the correlations between output 
bits and input vectors decrease the storage capacity. For the perceptron it is harder 
to learn a random bit sequence than a random classification problem. This is due to 
the correlations between input and output but not due to the correlations between the 
input vectors. 

If a perceptron which has learned a bit sequence perfectly is used as a bit generator, 
then any initial state of N bits taken from the sequence reproduces the complete 
sequence. Hence the sequence is an attractor of the bit generator. However we found, 
that the basin of attraction is very small. If only one bit is flipped in the initial state 
then there is a high probability that the generator runs into a different sequence. 

We have also studied two additional problems: 

(i) The P random bits are not repeated periodically but the perceptron is trained with 
a string of N + P random bits. Hence, there are still P patterns but an output bit 
belongs only to part of the other input patterns. On average the correlations are 
weaker. Indeed, we find that the storage capacity a c = 1.82 ± 0.02 is larger than 
the one for the periodic sequence. 

f Preliminary results have been reported 1994 by Bork 
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(ii) With a bias m = - Si in the bit sequence, the storage capacity increases. This 

p i=i 

is similar to the random classification problem (Gardner 1988). 
4. Perceptron: Hebbian learning rule 

In order to get some insight from analytic calculations we now consider the Hebbian 
learning rule 

^=4x>r- (5) 

JV u=l 

Output bits <j v and input vectors £ v are taken from a bit sequence {5",}, Eqs.(^J) and 
(0). It is known that the Hebbian weights cannot map the examples perfectly. However, 
the training error can be calculated from a signal to noise analysis (see for instance 
Hertz et al 1991). The sign of the following stability E v shows whether an example is 
classified correctly. 

i N P 

W = o»wg = t;EE^ ■ (6) 

iV i=i M =i 

The fraction of negative values of E v defines the training error. 

We calculated the first two moments (E) and (E 2 ) of E", where (...) means an 
average over the distribution of the examples, i.e. over all realizations of the bit sequence. 
If all bits o v and are random one has 

(*v> = Sun ; (ere?> = m« • ( 7 ) 

This gives 

(E) = 1 ; (E 2 ) = l + a. 

In the limit N —> oo the values of E v are Gaussian distributed with mean 1 and standard 
deviation sfa. However, for the periodic bit sequence, Eqs. (|I|) and (Q), the values of 
a v and £J are taken from the random bits For instance a v is identical with £? for 
j = 1, .., N and fi = u + N-\-l— j. Taking this into account we find for 1 < a < 2: 

, 1 H for P even 

(£>=<{ AT 

1 for P odd 

9^ ^ 6 ~ a 4 f P (8) 

2 + cH for P even 

iV iV z 



(P*> 



2 

2 + a for P odd 

N 



For a > 2 the results above for odd P hold for even ones, too. Hence for iV — > 00 



the standard derivation of the E u values is y/1 + a instead of ^fa of the uncorrected 
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case. The correlations increase the noise relatively to the signal. Assuming a Gaussian 
distribution of the E" values in the limit N — > oo, which is supported by our numerical 
simulation, we obtain the training error e t as 



with the error function 

X 

<f>(x) = J -=e- V ^dy. (10) 



If the random bits are not repeated periodically, but arranged linearly as discussed 
above, the moments depend on the number v of the pattern. If v — 1 is the first and 
v = P is the last pattern, we define 
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In this case the training error depends on 7 and we find 

e « = *("7ST7)- (12) 

Figure 3 shows the training error e t (a) for the uncorrected bits and the periodic 
bit sequence. In the latter case e t is averaged over the patterns. The correlations of the 
bit sequence increase the training error, in agreement with the decrease of the storage 
capacity shown in the previous section. 



5. General Boolean function 



Up to now we have restricted our map to a perceptron. We expect that multilayer 
networks can reproduce a larger bit sequence, in accordance to the higher storage 
capacity of the committee machine (Priel et al 1994). In this section we study the 
storage capacity of a general Boolean function b : {+1, — 1}^ — > {+1, —1}, which is the 
size of the random bit sequence with period P which can be reproduced by any Boolean 
function b, i.e. 

b(S„, . . . , S v+ n-i) = S u +n ; v — 1, . . . , P . (13) 

Since we have the freedom to choose for any input configuration (S u , . . . , S u+ n-i) 
an arbitrary output bit S u+ n, our problem reduces to the question if all of the input 
configurations are different from each other. If all (S v , . . . , Su+n-i) are different then 
we can define a Boolean function which maps each of those states to the corresponding 
bit S u+N . For the rest of the 2 N — P input states we have the freedom to choose an 



arbitrary output bit; hence in this case, there are 2^ 2N ~ P ^ many Boolean functions which 
map the bit sequence correctly 

If two of the input configurations (S v , . . . , S u+ n-i) are identical there is still a 
probability of 1/2 that the two output bits are different, too. To get an analytic estimate 
for the size of a random bit sequence which can be reproduced by a Boolean function 
we neglect correlations between the input configurations. That means we consider P 
configurations (S", . . . , S^) ; v — 1, . . . , P where all of the bits S" are chosen randomly 
and independently. We want to calculate the probability / that all of the P states are 
pairwisely different. There are 2 N many possible states. The first configuration v = 1 
can be any of those states. The second one can take any of the 2^ — 1 remaining states, 
etc. Hence, the number C of allowed configurations is 

C = 2 N (2 N -1)(2 N -2)---(2 N - P+l) (14) 

which gives 



p p 
\nC = J2 1x1(2^-^ + 1) = 

p 



7Vln2 + ln(l-— T 



(15) 



PiVln2 + £ln(l-^) . (16) 



2 N 

If P <C 2 N we can expand In and obtain 



lnC~PiVln2- — — ^ '- . (17) 



1 P[P 

2" 2 

Since the total number of all possible configurations is 2 PN , the function of the 
allowed ones is 

/ - exp ^ 

We define the average period P c by f(P c ) = 1/2 and obtain for large iV 

P c = V21n2 2^ . (19) 

Hence we expect that the average length of the bit sequence which can be reproduced 
by a Boolean function scales as the square root of 2^. In fact our problem is similar to 
the random map, where the average cycle length has the same scaling property (Harris 
1960, Derrida et al 1987). 

The configurations taken from a random bit sequence are correlated, since 
consecutive configurations are obtained by shifting a window of N bits over the sequence. 
However, our numerical simulations show that these correlations do not change the 
scaling law Eq. (|19|) . For a given sequence with P bits the size N of the window is 
increased until this sequence can be reproduced by a Boolean function. N c is defined 
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as the window size iV where 50% of the sequences are reproduced. In Figure 4, P is 
shown as a function of N c . For P < 17, N c is determined by exhaustive enumeration. 
For larger P values N c is estimated from up to 10 5 random samples. The log-linear plot 
shows that the data are consistent with 

? = 1.6x#. (20) 

The comparison with Eq. (|TP|) shows that the correlations seem to change the 
prefactor from 1.17 to 1.6, but the number still increases with the square root of 2^, 
the size of the input space. 

6. Summary 

A perceptron of iV input bits has been trained by a random bit sequence with a period 
P. Each output bit is contained in N input vectors. These correlations decrease the 
storage capacity to a c = 1.7 ± 0.02 compared to a c = 2 for uncorrected output bits. 
For the corresponding bit generator the bit sequence has a tiny basin of attraction. 

An analysis of Hebbian weights shows that a bit sequence gives a larger noise to 
signal ratio than a random classification problem. This result is in agreement with the 
lower storage capacity. 

If a general Boolean function is trained by the random bit sequence, the maximal 
period P scales as the square root of 2^, the size of the input space. 
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Figure 1 




Figure 1. A perceptron learning a periodic time series. The desired output of the 
perceptron (marked) is the next bit of the series and therefore part of other input 
patterns as well. 



Figure 2 
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Figure 2. Left hand side: The probability / of a bit sequence to be linearly separable 
as a function of a = P/N. The sequence is constructed from P random bits which are 
repeated periodically. The simulations are performed for a perceptron with N = 100 
input bits and / is averaged over 50 sets of patterns at least. Right hand side: The 
median learning time to the power of —1/2 as a function of a. The size of the perceptron 
is N — 400, and r is averaged over 1000 sets of patterns. The line is a least square fit 
to the data. 
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Figure 3 
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Figure 3. The training error of Hebbian wcigths for different topologies. The inputs 
are chosen binary. O: random patterns and x: patterns from a random bit sequence 
with periodic boundary condition. The simulations were done for N — 200 and 
averaged over 100 samples each. The lines show the theoretical results. 
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Figure 4. The Length of a cycle that is lcarnable by a Boolean function as a function 
of N c . The values up to P = 17 are exact. The values up to P = 100 are averaged 
over 100000 samples, for P = 251 over 50000, for P = 503 over 1000 and for P = 1007 
over 100 samples. The errorbars are given. The line shows P = exp(0.5) 2°- 5Nc 
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Table 1. The storage capacity of a perccptron learning different tasks. Measured with 
(1) half-error and (2) median learning-time method 





method 1 


method 2 


random TV = 100 


1.99 ±0.01 


1.995 ±0.01 


time series N = 100 


1.80 ± U.Uzo 


l.Sz ± U.Uz 


time series N = 400 




1.82 ±0.02 


ring N = 100 


1.7 ±0.025 


1.69 ±0.01 


ring N = 400 




1.7±0.02 


ring (rnd out) N = 100 


1.98 ±0.05 


1.99 ±0.01 


ring (rnd out) N = 400 




1.98 ±0.02 


magnetization m = 0.4 


N = 100 




ring 


1.95 ±0.05 


1.95 ±0.03 


random 


2.25 ±0.05 





